Method and apparatus for determining relative relevance between portions of large electronic documents

ABSTRACT

A technique for determining the relative relevance of electronic documents based on metadata/content associated with the document as a whole and/or metadata/content associated with one or more subdivisions of the electronic document. Metadata is associated with the document and various subdivision markers in the code of the document. A comparison of electronic documents may be made by comparing the metadata/content associated with the document and/or the subdivisions of the document to determine which documents contains subject matter that is relevant to the subject matter of another document or search criteria. The metadata/content may be weighted and these weights may be modified based on a rank profile A relevance score may be determined based on the comparison of the metadata/content for the documents and/or subdivisions of the documents as well as the weights attributed to the various subdivisions and documents.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention is generally directed to an improved computingsystem. More specifically, the present invention is directed to a methodand apparatus for determining the relative relevance between portions oflarge electronic documents.

2. Description of Related Art

With the present information age, access to literature has becomeincreasingly easy to obtain. As literature is moved from a physicalformat to an electronic format, more people are being able to gainaccess to the information contained in this literature through the useof computers, networks, the Internet, and the like.

Being able to compare literature, e.g., books, articles, magazines,etc., and determine the relevance of one piece of literature to another,has been a valuable tool for identifying other pieces of literature thatmay be of interest to a reader. Traditionally, this was done in a manualmanner such as through a manual cataloging scheme. Typically, thesemanual cataloging schemes use general topics, author names, title words,and the like, to determine which pieces of literature are most like oneanother and to categorize them in a similar category.

Manual comparisons are extremely time consuming when the number ofdocuments, e.g., books, being compare are huge and usually are subjectto personal biases. When a cataloging system is utilized, manualcomparisons further require a detailed understanding of the catalogingsystem by the person performing the comparison of the documents so thatthe appropriate categories for the documents are selected.

In recent years, as literature has been moved from physical books,magazines and the like, to electronic documents, techniques have beendevised to perform comparisons of electronic documents based on smallstandardized portions of the electronic document. For example,electronic documents typically will include an abstract and thecomparison between documents is made based on this abstract.

Abstract-based comparisons are extremely unreliable as the entireelectronic document, e.g., an electronic book, contains far moreinformation than what is contained in the abstract. Thus, the book mayhave portions that are applicable to many different other types ofbooks, yet the comparison of abstracts may not accurately reflect thisfact. Furthermore, two electronic documents may have the same abstract,yet contain entirely different contents.

Thus, it would be desirable to have an automated system that performs acomprehensive comparison of an electronic document with other electronicdocuments to generate comparison results indicating the relativerelevance of the documents to one another. Moreover, it would bebeneficial to provide such a comprehensive comparison with on-lineelectronic documents as part of a search engine for finding additionalelectronic documents and provide a ranking of the relative relevance ofthe additional electronic documents.

SUMMARY OF THE INVENTION

The present invention provides a mechanism for determining the relativerelevance of electronic documents based on metadata associated with thedocument as a whole and/or metadata associated with one or moresubdivisions of the electronic document. With the mechanism of thepresent invention, metadata is associated with the document and varioussubdivision markers in the code of the document. A comparison ofelectronic documents may be made by comparing the metadata associatedwith the document and/or the subdivisions of the document to determinewhich documents contains subject matter that is relevant to the subjectmatter of another document or search criteria. In addition, a comparisonof the actual content of the document or selected subdivisions of thedocument may be performed and, along with the comparison of themetadata, a determination as to the relevance of the documents orsubdivisions of the documents may be made.

The metadata and/or content associated with the document and/orsubdivisions may be provided with default weights that are assigned tothe document and/or subdivisions. These default weights are used tocalculate a score indicating the relative relevance of the documents toone another.

The default weights may further be modified by weight modifiers providedin a rank profile that may be established by a relative relevance searchengine provider or may be customized by users to their specific needs.This rank profile may designate a modifier of weights for the documentand/or subdivisions of the document. These modifiers may be, forexample, replacement weights, modifiers to the default weights, or thelike, for the document and/or subdivisions of the document. Themodifiers may be associated with a document and/or subsection type suchthat paragraphs may be weighted less than chapters which are weightedless than entire documents, for example. In this way, a relevance scoremay be determined based on the comparison of the metadata and/or contentfor the documents and/or subdivisions of the documents, the weightsassociated with the metadata and/or content, as well as the weightmodifiers attributed to the various subdivisions.

The scores determined for the documents represent the relative relevanceof the documents to the initial or base document or search criteria. Thescores may be used to create a ranked list of documents based on theirrelative relevance. This ranked list may be provided to a user of aclient device thereby indicating which documents are more relevant to aninitial or base document or search criteria. From this list, a documentmay be selected for retrieval. The selected document may then beretrieved and presented to a user via the client device.

These and other features and advantages of the present invention will bedescribed in, or will become apparent to those of ordinary skill in theart in view of, the following detailed description of the preferredembodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. The invention itself, however, as well asa preferred mode of use, further objectives and advantages thereof, willbest be understood by reference to the following detailed description ofan illustrative embodiment when read in conjunction with theaccompanying drawings, wherein:

FIG. 1 is an exemplary diagram of a distributed data processing systemin which the present invention may be implemented;

FIG. 2 is an exemplary block diagram of a server computing device inwhich aspects of the present invention may be implemented;

FIG. 3 is an exemplary block diagram of a client computing device inwhich aspects of the present invention may be implemented;

FIG. 4 is an exemplary block diagram of an electronic document havingsections and metadata associated with these sections in accordance withone exemplary embodiment of the present invention;

FIG. 5 is an exemplary message flow in accordance with one exemplaryembodiment of the present invention;

FIG. 6 is an exemplary block diagram of a relative relevance searchengine in accordance with one exemplary embodiment of the presentinvention; and

FIG. 7 is a flowchart outlining an exemplary operation of one embodimentof the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention provides a mechanism for comparing electronicdocuments based on metadata and content associated with sections of theelectronic documents in order to obtain a relative relevance of theelectronic documents. Since the present invention is directed to theidentification of electronic documents that are most relevant to aninitial electronic document or portion of content, the present inventionis especially suited to a distributed data processing environment inwhich there may be a large library of electronic documents available,e.g., the Internet. As such, in order to provide a context for thedescription of the present invention, FIGS. 1-3 are offered as a briefoverview of a distributed data processing environment and some of thecomputing devices that are part of this distributed data processingenvironment in which aspects of the present invention may beimplemented.

With reference now to the figures, FIG. 1 depicts a pictorialrepresentation of a network of data processing systems in which thepresent invention may be implemented. Network data processing system 100is a network of computers in which the present invention may beimplemented. Network data processing system 100 contains a network 102,which is the medium used to provide communications links between variousdevices and computers connected together within network data processingsystem 100. Network 102 may include connections, such as wire, wirelesscommunication links, or fiber optic cables.

In the depicted example, server 104 is connected to network 102 alongwith storage unit 106. In addition, clients 108, 110, and 112 areconnected to network 102. These clients 108, 110, and 112 may be, forexample, personal computers or network computers. In the depictedexample, server 104 provides data, such as boot files, operating systemimages, and applications to clients 108-112. Clients 108, 110, and 112are clients to server 104. Network data processing system 100 mayinclude additional servers, clients, and other devices not shown. In thedepicted example, network data processing system 100 is the Internetwith network 102 representing a worldwide collection of networks andgateways that use the Transmission Control Protocol/Internet Protocol(TCP/IP) suite of protocols to communicate with one another. At theheart of the Internet is a backbone of high-speed data communicationlines between major nodes or host computers, consisting of thousands ofcommercial, government, educational and other computer systems thatroute data and messages. Of course, network data processing system 100also may be implemented as a number of different types of networks, suchas for example, an intranet, a local area network (LAN), or a wide areanetwork (WAN). FIG. 1 is intended as an example, and not as anarchitectural limitation for the present invention.

Referring to FIG. 2, a block diagram of a data processing system thatmay be implemented as a server, such as server 104 in FIG. 1, isdepicted in accordance with a preferred embodiment of the presentinvention. Data processing system 200 may be a symmetric multiprocessor(SMP) system including a plurality of processors 202 and 204 connectedto system bus 206. Alternatively, a single processor system may beemployed. Also connected to system bus 206 is memory controller/cache208, which provides an interface to local memory 209. I/O bus bridge 210is connected to system bus 206 and provides an interface to I/O bus 212.Memory controller/cache 208 and I/O bus bridge 210 may be integrated asdepicted.

Peripheral component interconnect (PCI) bus bridge 214 connected to I/Obus 212 provides an interface to PCI local bus 216. A number of modemsmay be connected to PCI local bus 216. Typical PCI bus implementationswill support four PCI expansion slots or add-in connectors.Communications links to clients 108-112 in FIG. 1 may be providedthrough modem 218 and network adapter 220 connected to PCI local bus 216through add-in boards.

Additional PCI bus bridges 222 and 224 provide interfaces for additionalPCI local buses 226 and 228, from which additional modems or networkadapters may be supported. In this manner, data processing system 200allows connections to multiple network computers. A memory-mappedgraphics adapter 230 and hard disk 232 may also be connected to I/O bus212 as depicted, either directly or indirectly.

Those of ordinary skill in the art will appreciate that the hardwaredepicted in FIG. 2 may vary. For example, other peripheral devices, suchas optical disk drives and the like, also may be used in addition to orin place of the hardware depicted. The depicted example is not meant toimply architectural limitations with respect to the present invention.

The data processing system depicted in FIG. 2 may be, for example, anIBM eServer pSeries system, a product of International Business MachinesCorporation in Armonk, N.Y., running the Advanced Interactive Executive(AIX) operating system or LINUX operating system.

With reference now to FIG. 3, a block diagram illustrating a dataprocessing system is depicted in which the present invention may beimplemented. Data processing system 300 is an example of a clientcomputer. Data processing system 300 employs a peripheral componentinterconnect (PCI) local bus architecture. Although the depicted exampleemploys a PCI bus, other bus architectures such as Accelerated GraphicsPort (AGP) and Industry Standard Architecture (ISA) may be used.Processor 302 and main memory 304 are connected to PCI local bus 306through PCI bridge 308. PCI bridge 308 also may include an integratedmemory controller and cache memory for processor 302. Additionalconnections to PCI local bus 306 may be made through direct componentinterconnection or through add-in boards. In the depicted example, localarea network (LAN) adapter 310, SCSI host bus adapter 312, and expansionbus interface 314 are connected to PCI local bus 306 by direct componentconnection. In contrast, audio adapter 316, graphics adapter 318, andaudio/video adapter 319 are connected to PCI local bus 306 by add-inboards inserted into expansion slots. Expansion bus interface 314provides a connection for a keyboard and mouse adapter 320, modem 322,and additional memory 324. Small computer system interface (SCSI) hostbus adapter 312 provides a connection for hard disk drive 326, tapedrive 328, and CD-ROM drive 330. Typical PCI local bus implementationswill support three or four PCI expansion slots or add-in connectors.

An operating system runs on processor 302 and is used to coordinate andprovide control of various components within data processing system 300in FIG. 3. The operating system may be a commercially availableoperating system, such as Windows XP, which is available from MicrosoftCorporation. An object oriented programming system such as Java may runin conjunction with the operating system and provide calls to theoperating system from Java programs or applications executing on dataprocessing system 300. “Java” is a trademark of Sun Microsystems, Inc.Instructions for the operating system, the object-oriented programmingsystem, and applications or programs are located on storage devices,such as hard disk drive 326, and may be loaded into main memory 304 forexecution by processor 302.

Those of ordinary skill in the art will appreciate that the hardware inFIG. 3 may vary depending on the implementation. Other internal hardwareor peripheral devices, such as flash read-only memory (ROM), equivalentnonvolatile memory, or optical disk drives and the like, may be used inaddition to or in place of the hardware depicted in FIG. 3. Also, theprocesses of the present invention may be applied to a multiprocessordata processing system.

As another example, data processing system 300 may be a stand-alonesystem configured to be bootable without relying on some type of networkcommunication interface. As a further example, data processing system300 may be a personal digital assistant (PDA) device, which isconfigured with ROM and/or flash ROM in order to provide non-volatilememory for storing operating system files and/or user-generated data.

The depicted example in FIG. 3 and above-described examples are notmeant to imply architectural limitations. For example, data processingsystem 300 also may be a notebook computer or hand held computer inaddition to taking the form of a PDA. Data processing system 300 alsomay be a kiosk or a Web appliance.

The present invention provides a mechanism for determining the relativerelevance of electronic documents based on metadata associated with thedocument as a whole and/or metadata associated with one or moresubdivisions of the electronic document. In addition, the content of thedocuments and/or selected subdivisions of the documents may be comparedand, along with the comparisons of the metadata, an overall measure ofrelative relevance of two or more electronic documents may bedetermined.

With the mechanism of the present invention, metadata is associated withthe document and various subdivision markers in the code of thedocument. A comparison of electronic documents may be made by comparingthe metadata associated with the document and/or the subdivisions of thedocument to determine which documents contains subject matter that isrelevant to the subject matter of another document or search criteria.In addition, the actual content of the document or subdivisions of thedocument may be compared along with the metadata to determine whichdocuments contain subject matter that is relevant to the subject matterof another document or search criteria.

In a preferred embodiment, subsections of the documents, and portions ofmetadata associated with the subsections of the documents, haveassociated default weights that are assigned by a provider of thedocuments. The total for all weights of subsections and metadata for adocument should sum to a standardized value, e.g., 100, 1.0, or thelike. These weights are used to determine a relative relevance of thevarious subsections of the document to a matching criteria.

The weights for the metadata associated with the document and/orsubdivisions and the weights for the content of the document and/orsubdivisions may be adjusted based on modifiers provided in a rankprofile that may be established by a relative relevance search engineprovider or may be customized by users to their specific needs. Thisrank profile may designate the weight modifiers to be adjustments to theweights involved in a relevance comparison, may designate alternative orreplacement weights, or the like. The weight modifiers may be associatedwith a document and/or subsection type such that paragraphs may beweighted less than chapters which are weighted less than entiredocuments, for example. In this way, a relevance score may be determinedbased on the weights of the metadata and content for the documentsand/or subdivisions of the documents as well as the weight adjustmentsattributed to the document and/or the various subdivisions and metadata.

The scores determined for the documents represent the relative relevanceof the documents to the initial or base document or search criteria. Thescores may be used to create a ranked list of documents based on theirrelative relevance. This ranked list may be provided to a user of aclient device thereby indicating which documents are more relevant to aninitial or base document or search criteria. From this list, a documentmay be selected for retrieval. The selected document may then beretrieved and presented to a user via the client device.

With the present invention, electronic documents are created using amarkup language, such as Extended Markup Language (XML), HypertextMarkup Language (HTML), or the like. The code of the electronic documentincludes tags that designate the subsections of the electronic document.These tags may designate, for example, chapters, sections, pages,paragraphs, etc.

In a preferred embodiment, these electronic documents are largeelectronic documents such as electronic books, magazines, and the like.However, the present invention is not limited to such. Rather, anyelectronic document in which subdivisions of the electronic document aredesignated by tags may be used with the present invention withoutdeparting from the spirit and scope of the present invention.

In addition to having these tags designating subdivisions of anelectronic document, the present invention provides metadata tags thatare to be associated with the electronic document and the subdivisiontags. These metadata tags designate characteristics of the subdivisionthat are to be used when comparing the subdivision to other electronicdocuments and/or subdivisions of other electronic documents. Forexample, these metadata tags may designate titles of subdivisions,technologies covered by the subdivisions, keywords associated with thesubdivision, main ideas of the subdivision, whether examples or samplecode are provided in the subsection, references associated with thesubsection, and other metadata identifying the characteristics of thesubsection that may be of interest when comparing subsections ofdocuments.

The following is an example of the type of metadata tags that may beincluded in the markup language code of an electronic document:

-   <chapter title=“J2EE security” technologies_covered=“EJB, JSP, JDBC,    HTTP, Servlet”, keywords=“security, J2EE, authentication,    authorization, SSL”>-   Chapter 10 —J2EE Security-   <paragraph main_idea=“J2EE and SSL”, sample_code_used=“yes”,    references=“some referenced”>

J2EE Security comprises many pieces. This version of J2EE has beenupgraded and improved to provide useful new features . . .

-   </paragraph>-   </chapter>

Of course, in a large electronic document, such as an electronic book ormagazine, there would be far more text and many more metadata tagsassociated with subdivisions of the large electronic document. Thesemetadata tags may be provided by a creator of the electronic document, apublisher of the electronic document, or other authority that has accessand permission to modify the original code of the electronic document toinclude these metadata tags and their associated values.

The metadata tags associated with the section tags of the electronicdocument are preferably standardized tags that are understandable by arelative relevance search engine, however the present invention is notlimited to such. Rather than actually knowing the metadata tags, therelative relevance search engine may make a simple comparison betweenthe names and values of tags of two or more electronic documents orsubdivisions of electronic documents to determine those metadata tagsthat match for two or more electronic documents and/or subdivisions ofelectronic documents.

FIG. 4 is an exemplary block diagram of an electronic document havingsections and metadata associated with these sections in accordance withone exemplary embodiment of the present invention. As shown in FIG. 4,metadata tags, or simply “metadata”, may be associated with variousgradations of divisions of an electronic document. Some metadata 410 maybe associated with the electronic document 400 as a whole and othermetadata 420 may be associated with a plurality of subsections of theelectronic document 400. Still further, some metadata 430 may beassociated with the contents of a section of the electronic document400. In addition, some sections of the electronic document 400 may becomprised entirely of metadata 440 associated with contents and nometadata being associated with sections as a whole.

Upon the request for similar documents to an initial or base document, aportion of the initial or base document, or in response to a searchrequest in which search criteria are designated, metadata associatedwith electronic documents from one or more sources of electronicdocuments is retrieved. This metadata is then compared to the metadataassociated with the initial or base document, the portion of the initialor base document, or the search criteria. Based on this comparison, ascore is calculated for each document whose metadata is retrieved andcompared to the base document or search criteria based on the weights,or modified weights, attributed to matching portions of the metadata.The scores are then used to generate a ranked list of documents which isreturned to a client device.

In addition, or alternative to the comparison of the metadata of two ormore documents, the present invention may perform a direct comparison ofthe content of the documents, selected portion of the documents, etc.This may be achieved by performing, for example, a literal comparisonLCOMP operation on the content. The result would indicate a measure ofmatching of the two documents, portions of the documents, etc. Thismeasure may then be weighted by an associated weight, which may itselfbe modified based on modifiers set forth in a rank profile, and usedalong with the comparison of the metadata to generate a score, asdetailed hereafter.

With the present invention a user may enter a relative relevance searchrequest using a client browser application that is augmented to providea mechanism for entry of relative relevance search criteria. Forexample, the browser may be enhanced such that a user may view adocument via the browser and select a function from a menu requestingother documents meeting certain relative relevance search criteria. Thissearch criteria may be, for example, to find other documents that dealwith the same subject matter as the currently displayed document, thecurrently displayed portion of the document, a portion of the documentin which a cursor is presently located, a highlighted word or phrase inthe displayed document, or other search criteria that are specificallyentered by the user.

For example, a first document may be displayed using the augmentedbrowser. During reading of the first document, a user may determine thata particular topic being covered is of additional interest to the user.As such, the user may, while reading the first document, select anoption from a menu or other user interface element, to initiate arelative relevance search for other documents based on the contents ofthe currently displayed document. For example, the user may select anoption from a menu to find other documents that contain similar metadatato the portion of the document currently displayed, the entire document,a paragraph in which the cursor is currently present, or the like.

In response to the selection of one of these options, the client sidebrowser extracts the metadata and/or content from the code for thecurrently displayed electronic document for the selected portion orportions of the currently displayed electronic document and generates arelative relevance search request based on the extracted metadata and/orcontent. The client side browser then sends a relative relevance searchrequest to a server through which a search engine service is provided.Alternatively, the client side browser may simply send an identifier ofthe document and the selection portion or portions of the document aspart of the relative relevance search request with the search engineperforming the extraction of metadata and/or content for the selectedportions of the electronic document at a server.

In either case, the server that receives the relative relevance searchrequest performs a search of other electronic documents registered withthe search engine of the server to determine if there are any otherdocuments relevant to the relative relevance search request criteria.These other electronic documents may be provided by one or moreelectronic document sources. The registration of these electronicdocuments with the search engine of the server may include, for example,providing the metadata and/or content for select portions of theelectronic document to the search engine so that it may be used indetermining which other electronic documents are relevant to therelative relevance search request criteria.

Alternatively, the metadata and/or content, or at least a portion of themetadata and/or content, for each document may be retrieved from thedocument source each time there is a relative relevance search request.In order to reduce the amount of traffic, however, the amount ofmetadata transferred from the document source to the server in order toperform the relative relevance search may be minimized by initiallysending only a first portion of the metadata to the server and sendingsubsequent portions of the metadata only upon a determination that thealready sent portion of the metadata indicates a threshold amount ofrelevance to the relative relevance search request criteria.

This iterative process may also be done in embodiments where themetadata is stored locally with regard to the server providing thesearch engine in order to speed up the search process by quickly“weeding-out” the electronic documents that have no relevance to thecurrent relative relevance search request criteria. However, it shouldbe noted that in order to provide the most comprehensive search, in viewof the fact that subdivisions of documents may address topics that arenot addressed in other subdivisions or are not the focus of the documentas a whole, all of the metadata and/or content for the entire documentshould be used in the relative relevance search.

Once the metadata and/or content for the initial document or basedocument is received by the server from the client device, or extractedby the server in response to the relative relevance search request, andthe metadata/content for one or more electronic documents is receivedfrom a local storage or document source, a comparison of themetadata/content may be performed to determine the relative relevance ofthe metadata/content of the two electronic documents. A score may thenbe attributed to the electronic document from the document source orthat is represented by the locally stored metadata/content. This scoreis a measure of the relevance of the electronic document to the selectedportion or portions of the initial or base document, or the searchcriteria entered by the user. The score may then be used to rank theelectronic document relative to other electronic documents to indicatewhich electronic documents are more relevant than others to the selectedportion or portions of the initial or base document or the searchcriteria entered by the user.

The score may be determined in any number of different ways. Thefollowing is only an example of how the base score may be calculated andis not intended to assert or imply any limitation on the manner by whicha score may be calculated for an electronic document based on themetadata associated with portions of the electronic document.

In one exemplary embodiment, the score for a particular portion ofmetadata/content may be calculated by determining how many values forthe portion of metadata/content match between the metadata/content forthe selected portion of the initial document and metadata/content forone or more portions for another electronic document. For example, ifthe metadata of the initial document includes the attribute“technologies_covered” and the values for this attribute are EJB, JSP,JDBC, HTTP, and Servlet, a determination is made as to whether themetadata for one or more portions of the other electronic document matchtheses values. Thus, if the metadata for a portion of the otherelectronic document includes the attribute “technologies_covered” andhas the values EJB, HTTP, and Servlet, then there are matching valuesdetermined to exist. Each matching value may be used to determine apercentage of correspondence between the portions of metadata. Forexample, since three out of the 5 terms in the value portion of themetadata match, the percentage of correspondence is 0.60 or 60%.

As mentioned previously, portions of metadata and/or content are givendefault weights that are used to represent the relative importance ofthe various portions of the metadata and/or content. These weights areused along with the measure of correspondence to determine a weightvalue for the portion of metadata/content. These default weights forportions of metadata/content of a document preferably sum to a standardnumber for the entire document. For example, all documents may have thesum of their weights equal to 1.0 or 100. Thus, while document A mayhave a total weight value of 1.0 and document B may have a total weightvalue of 1.0, the weights attributed to their individual portions ofmetadata and content may vary within the documents.

In addition, based on a rank profile, different portions of the documentmay have their default weights modified such that different weights areassociated with each portion of the metadata/content than was set by thedefault weights. This allows a user or search engine provider toreassign weights within documents based on a personal preference of theuser and/or search engine provider. This rank profile may be establishedby the search engine provider or may be a custom rank profileestablished by a user of a client device and stored in a profile for theuser that is associated with the search engine.

In the above example, assume that the default weight for the portion ofmetadata is 0.3. Using the default weight, the product of the weight andthe measure of correspondence, i.e. the score for this portion ofmetadata, is determined to be 0.18 (i.e. 0.3*0.6). Now assume that auser wishes to modify the default weight and instead, assigns a weightof 0.5 to the portion of metadata set forth above. The resulting scorefor this portion is determined to be 0.30 (i.e. 0.5*0.6).

The measure of correspondence between the metadata/content of the twodocuments, the weights associated with the portions of themetadata/content being compared, and the modifiers to these weights areall combined to generate an overall score for the document or portionsof the document that are being considered in the comparison to the basedocument, portions of the base document, or search criteria. The scoresfor a plurality of documents are then compared to generate the rankedlist of documents.

The following is an example of how a weighted score may be calculatedfor electronic documents using the present invention. The followingexample is not meant to state or imply any limitation on the manner bywhich a weighted score may be calculated for electronic documents and isoffered only as an example.

With the present invention, the weight of a section, e.g., asubdivision, of an electronic document is equal to the sum of the weightof the section's metadata and the weight of its contents. This isrepresented as:W _(section) =W _(meta) +W _(content)   (1)

where W_(section) is the weight for the section, W_(meta) is the weightattributed to the metadata of the section, and W_(content) is the weightattributed to the content of the section.

The weight of a section's metadata is equal to the sum of the weights ofall the name-value pairs that match. That is, metadata attributes arepresented as name-value pairs, e.g., technologies_covered=“EJB”. Thename-value pairs that match between the initial or base electronicdocument metadata for the selected portion and the metadata for thesection of the other document increase the weight of the section. If allof the name-value pairs in the metadata for the initial document and theother document match, then the sum of those weights will yield a highsection metadata weight.

As mentioned above, there may be a measure of correspondence associatedwith metadata. That is, the name-value pairs may partially match andthus, the weight used in the following equations may be a product of theweight value attributed to the portion of the metadata and the measureof correspondence.

The equation for determining the weight of the metadata is illustratedby: $\begin{matrix}{W_{meta} = {\sum\limits_{i = 1}^{n}{W_{pair}(i)}}} & (2)\end{matrix}$

where i is the current name-value pair, W_(meta) is the weightattributed to the metadata of the section, W_(pair) is the weightattributed to the name-value pairs, and n is the number of name-valuepairs.

The weight of a section's content is equal to the sum of the weights ofall of the subsections of that section. Thus, the weight attributed tothe content of a section may be found using the following equation:$\begin{matrix}{W_{content} = {\sum\limits_{i = 1}^{P}{W_{subsection}(i)}}} & (3)\end{matrix}$

where W_(content) is the weight of the section attributed to thecontents of the section, W_(subsection) is the weight of the metadataassociated with the subsections of the section, and p is the number ofsubsections in the section.

Using the above equations (1), (2) and (3), for any section, the weightof that section may be obtained using the following equation:$\begin{matrix}{W_{section} = {{\sum\limits_{i = 1}^{m}{W_{meta}(i)}} + {\sum\limits_{j = 1}^{P}{W_{subsection}(j)}}}} & (4)\end{matrix}$

where W_(section) is the weight of the section, W_(meta) is the weightof the section metadata, and W_(subsection) is the weight of themetadata for each subsection of the section.

Summing the weights for all of the sections of an electronic document,such as a book, results in a weight for the entire electronic document:$\begin{matrix}\begin{matrix}{W_{doc} = {W_{{doc} - {meta}} + W_{{doc} - {content}}}} \\{= {{\sum\limits_{i = 1}^{m}{W_{{doc} - {meta}}(i)}} + {\sum\limits_{j = 1}^{p}{W_{{doc} - {section}}(j)}}}}\end{matrix} & (5)\end{matrix}$

where W_(doc) is the weight of the entire document, W_(doc-meta) is theweight of the document attributed to the document metadata,W_(doc-content) is the weight of the document attributed to the contentsof the document, m is the number of metadata name-value pairs in thedocument metadata, e.g., the global metadata, p is the number ofsections in the document, and W_(doc-section) is the weight of eachsection of the document. This equation simply states that the weight ofthe entire electronic document equals the weight of the document'smetadata, e.g., the global document metadata, plus the weights of thecontents of each section of the electronic document.

It should be noted that the relevance ranking based on weights is notlimited to the entire electronic document. Rather, the relevance ofindividual portions of the electronic document may be determinedutilizing the above methodology and rankings provided based only onselected portions of documents.

When comparing any two sections of two electronic documents, the resultis a calculation of the weight of the metadata for both sections timesthe comparison of each section's metadata, plus the content comparisonmultiplied by the two content's weights. Thus, the comparison of any twosections of two electronic documents may be represented as:$\begin{matrix}{{{Comp}( {s_{a},s_{b}} )} = {{{w_{meta}(a)}*{W_{meta}(b)}*{{Comp}( {m_{a},m_{b}} )}} + {{W_{content}(a)}*{W_{content}(b)}*{{Comp}( {c_{a},c_{b}} )}}}} & (6)\end{matrix}$

where Comp(s_(a),s_(b)) is the comparison of two sections s_(a) ands_(b), W_(meta)(a) is the weight of the metadata for section s_(a),W_(meta)(b) is the weight of the metadata for section s_(b),Comp(m_(a),m_(b)) is the comparison of the metadata for section s_(a) tothe metadata for section s_(b), W_(content)(a) is the weight of thecontent of section s_(a), W_(content)(b) is the weight of the content ofsection s_(b), and Comp(c_(a), c_(b)) is the comparison of the contentsfor sections s_(a) and s_(b). By way of example, assume that a firstdocument has a first section s_(a) and a second document has a sections_(b) that is being compared to s_(a). The weight assigned to themetadata for section s_(a) is 0.3 after any adjustments due to a rankprofile if any. The weight assigned to the metadata for section s_(b) is0.4 after any adjustments due to a rank profile if any. The textualcomparison of section s_(a) to section s_(b) results in a Comp valuethat identifies a measure of correspondence, such as a percentage of thetext that matches, e.g., 0.70. Similar values may be provided for thecontent of the sections s_(a) and s_(b) such that the weights are 0.5for the content of s_(a), 0.3 for the content of section s_(b), and thecorrespondence measure is, for example, 0.60. The result of the aboveequation would give the following score or measure of relevance:Comp(s _(a) ,s _(b))=0.3*0.4*0.7+0.5*0.3*0.6=0.174

This value may then be compared to other similarly obtain values forother sections of the same or different documents to determine whichsections are more relevant to section s_(a) than others. Alternatively,the Comp values for selected sections of a document may be summed togenerate a score for the entire document.

The comparison of two portions of metadata may be performed using thefollowing equation: $\begin{matrix}{{{Comp}( {m_{a},m_{b}} )} = {\sum\limits_{{i = 1},{j = 1}}^{m,n}{{W_{a - {pair}}(i)}*{W_{b - {pair}}(j)}*{{LComp}\lbrack {{n_{a}(i)},{n_{b}(j)}} \rbrack}*{{LComp}\lbrack {{v_{a}(i)},{v_{b}(j)}} \rbrack}}}} & (7)\end{matrix}$

where n_(a) and n_(b) are the metadata names for sections s_(a) ands_(b), v_(a) and v_(b) are the values associated with the names forsections s_(a) and s_(b), LComp is a literal string comparison functionthat is generally known in the art, W_(a-pair) and W_(b-pair) are theweights associated with the name-value pairs for sections s_(a) ands_(b), m is the number of name-value pairs for section s_(a) and n isthe number of name-value pairs for the section s_(b).

When section a and section b have subsections, the comparison valueComp(c_(a),c_(b)) of the contents of sections s_(a) and s_(b) isdetermined using the following equation: $\begin{matrix}{{{Comp}( {c_{a},c_{b}} )} = {\sum\limits_{{i = 1},{j = 1}}^{m,n}{{W_{a - {section}}(i)}*{W_{b - {section}}(j)}*{{Comp}\lbrack {{s_{a}(i)},{s_{b}(j)}} \rbrack}}}} & (8)\end{matrix}$

where c_(a) is the contents of section s_(a), c_(b) is the contents ofsection s_(b), W_(a-section) is the weight for the subsections ofsection s_(a), W_(b-section) is the weight for the subsections ofsection s_(b), Comp[s_(a)(i), s_(b)(j)] is the comparison value obtainedfrom a comparison of the subsections of section s_(a) and thesubsections of section s_(b), m is the number of subsections in sections_(a), and n is the number of subsections in section s_(b).

When section s_(a) and section s_(b) do not have subsections, thecomparison value Comp(c_(a),c_(b)) of the contents of sections s_(a) ands_(b) is determined using the following equation:Comp(c _(a) ,c _(b))=W _(ca) *W _(cb) *LComp[c _(a) ,c _(b))   (9)

where c_(a) is the contents of section s_(a), c_(b) is the contents ofsection s_(b), W_(ca) is the weight of the contents of section s_(a),W_(cb) is the weight of the contents of section s_(b), andLComp[c_(a),c_(b)] is a literal string comparison of the metadata forthe contents c_(a) and c_(b).

The comparison of two documents, e.g., documents D_(a) and D_(b), maythen be made by using the following equation: $\begin{matrix}{{{Comp}( {D_{a},D_{b}} )} = {{{W_{{doc} - {meta}}(a)}*{W_{{doc} - {meta}}(b)}*{{Comp}( {m_{a - {doc}},m_{b - {doc}}} )}} + {{W_{{doc} - {content}}(a)}*{W_{{doc} - {content}}(b)}*{{Comp}( {c_{a - {doc}},c_{b - {doc}}} )}}}} & (10)\end{matrix}$

The equations above show how the relevance rankings may be produced inaccordance with one exemplary embodiment of the present invention. Theweights (W) in the equations can be changed based on the weightmodifiers or replacement weights designated in the rank profile. Therank profile may be customizable by a user so that an individual rankprofile may be generated for each user and stored in association withthe search engine of the present invention.

Once scores, e.g. Comp function values, for each other electronicdocument of interest are calculated, the scores may be ordered based ontheir values to determine which electronic documents are most relevantto the one or more portions of the initial document selected andidentified in the relative relevance search request criteria. Theordered list of documents may then be provided to a client device sothat the list may be displayed to a user via a browser or othergraphical user interface. The user may then select a document from thelist in order to initiate download of the content of the selecteddocument to the client device.

FIG. 5 is an exemplary message flow in accordance with one exemplaryembodiment of the present invention. This message flow outlines theprocess described above. As shown in FIG. 5, a client 510 sends arequest for an initial document to a server 520 that provides a searchengine service. The server 520 sends a request for the initial documentto a document source 540 which then returns the initial document data tothe server 520. The server 520 forwards this initial document data tothe client device 510 where it is displayed on the client device using abrowser application.

At some time later, while viewing the initial document, a user of theclient device 510 selects the document, a portion of the document, orenters his/her own relative relevance search criteria using the browserapplication. The user then initiates a relative relevance search requestwhich is sent to the server 520. The server 520 forwards the relativerelevance search request to the relative relevance search engine 530.The relative relevance search engine 530 requests metadata and/orcontent information for registered electronic documents from thedocument source 540. As mentioned above, in an alternative embodiment,the metadata and/or content for these documents may be stored locally sothat it need not be requested from the document source 540.

The document source 540 returns the metadata and/or content for theregistered electronic documents to the relative relevance search engine530. As previously mentioned, in some embodiments, this may be aniterative operation in which small portions of the metadata and/orcontent are transmitted with subsequent transmissions only occurringwith regard to a particular document if it is determined that thedocument is relevant to the current relative relevance search requestcriteria.

The metadata/content for the documents that was retrieved is compared tothe metadata/content for the document, portion or portions of thedocument, or the search criteria submitted in the relative relevancesearch request. Scores values are determined for each of the documentsbased on their metadata and/or content and the weights associated withthe metadata/content. The documents are then ordered based on the valuesof the scores. In this way, the documents are ranked in accordance withtheir relevance to the initial document, the selected portions of theinitial document, or the search criteria entered by the user.

The ranked list of documents is provided to the server 520 which thenforwards the list to the client device 510. The user of the clientdevice 510 may select a document from the list to thereby initiatedownload of the data corresponding to the selected document. As aresult, a request for the selected document is sent from the server 520to the document source 540 which returns the selected document data tothe server 520. The server 520 then forwards the data to the clientdevice 510 where the selected document is displayed to the user.

FIG. 6 is an exemplary block diagram of a relative relevance searchengine in accordance with one exemplary embodiment of the presentinvention. The elements shown in FIG. 6 may be implemented as hardware,software, or any combination of hardware and software. In a preferredembodiment, the elements of FIG. 6 are implemented as softwareinstructions executed by one or more data processing devices.

As shown in FIG. 6, the relative relevance search engine includes acontroller 610, a network interface 620, a storage interface 630, acomparison module 640, and a ranking module 650. The elements 610-650are in communication with one another via the control/data signal bus660. Although a bus architecture is shown in FIG. 6, the presentinvention is not limited to such and any architecture that facilitatesthe communication of control/data signals between the elements 610-650may be used without departing from the spirit and scope of the presentinvention.

The controller 610 controls the overall operation of the relativerelevance search engine and orchestrates the operation of the otherelements 620-650. The network interface 620 provides a communicationinterface through which relative relevance search requests may bereceived from client devices, requests for document metadata/content maybe sent, document metadata/content may be received, and the results ofthe relative relevance searches may be sent to client devices.

The storage interface 630 provides a communication interface for storingmetadata/content associated with documents in a storage device 670. Thismay be metadata/content that is stored temporarily in order tofacilitate the comparisons of metadata and/or content for portions ofthe electronic documents or a more permanent storage of metadata/contentfor later retrieval in order to perform such comparisons. In eithercase, the storing and retrieval of metadata/content to and from thestorage device 670, either on a temporary or more permanent basis, isperformed via the storage interface 630.

The comparison module 640 performs the functions previously describedfor comparing the metadata and/or content associated with the documentsand subdivisions of the documents in order to calculate a score for thesubdivisions of the document and/or the documents themselves. Thecomparison module 640 compares the metadata and/or content retrieved forone or more electronic documents against the metadata and/or contentdesignated in the relative relevance search request for an initial orbase document, one or more selected portions of an initial or basedocument, or search criteria explicitly entered by a user of a clientdevice that sent the relative relevance search request. Based on thecomparison, and weight values associated with the metadata/content,scores are calculated for the one or more documents. These scores arethen provided to the ranking module 650.

The ranking module 650 ranks the one or more documents based on thescores. For example, a greatest to least value listing may be generatedwhere documents that are most relevant to the initial document, one ormore selected portions of the initial document, or the search criteria,are listed first in the ranked list. Other organizations of the rankedlist, such as least relevant to most relevant, may be utilized withoutdeparting from the spirit and scope of the present invention.

The controller 610 may receive the ranked list from the ranking module650 and send the ranked list to a source of the relative relevancesearch request via the network interface 620. In this way, the user ofthe client device that transmitted the relative relevance search requestis presented with a listing of documents in relevance order asdetermined based on a weighted comparison of the metadata of thedocuments to the initial document, portions of the initial document, orsearch criteria entered by the user.

FIG. 7 is a flowchart outlining an exemplary operation of one embodimentof the present invention. It will be understood that each block of theflowchart illustration, and combinations of blocks in the flowchartillustration, can be implemented by computer program instructions. Thesecomputer program instructions may be provided to a processor or otherprogrammable data processing apparatus to produce a machine, such thatthe instructions which execute on the processor or other programmabledata processing apparatus create means for implementing the functionsspecified in the flowchart block or blocks. These computer programinstructions may also be stored in a computer-readable memory or storagemedium that can direct a processor or other programmable data processingapparatus to function in a particular manner, such that the instructionsstored in the computer-readable memory or storage medium produce anarticle of manufacture including instruction means which implement thefunctions specified in the flowchart block or blocks.

Accordingly, blocks of the flowchart illustration support combinationsof means for performing the specified functions, combinations of stepsfor performing the specified functions and program instruction means forperforming the specified functions. It will also be understood that eachblock of the flowchart illustration, and combinations of blocks in theflowchart illustration, can be implemented by special purposehardware-based computer systems which perform the specified functions orsteps, or by combinations of special purpose hardware and computerinstructions.

As shown in FIG. 7, the operation starts by receiving a relativerelevance search request in which search criteria are designated (step710). This search criteria may be the metadata and/or content for anentire initial document, metadata and/or content associated with one ormore selected portions of an initial document; or search criteriaspecifically entered by a user, for example.

Metadata/content for other electronic documents is then retrieved (step720). The metadata/content for the other electronic documents iscompared to the search criteria (step 730) and a score is generated foreach of the other electronic documents (step 740). As described above,the calculation of a score may involve comparing the metadata/content ofeach other electronic document to the metadata/content of the initialelectronic document, one or more selected portions of the initialelectronic document, or user entered metadata/content type searchcriteria, and generating a score based on the weights associated withthese portions of metadata/content and a measure of the correspondencebetween these portions of metadata/content. This may be done based onthe metadata/content for the entire document, the metadata/content forindividual sections, and/or the metadata/content for individualsubsections, as discussed above. The weighted values obtained for eachselected section and subsection of the documents may be summed to arriveat a score for a section of the electronic document and/or for theelectronic document as a whole.

The other electronic documents are then ranked based on the calculatedscores (step 750). The ranked list is then returned as the results ofthe relative relevance search (step 760) and the operation terminates.

Thus, the present invention provides a mechanism by which portions of adocument may be selected and other documents relevant to the selectionportions of the document may be identified. The identification of theseother documents is based on a measure of the correspondence of metadataand/or content associated with the documents, weights associated withthe metadata/content, and modifications to these weights provided in arank profile. In this way, various granularities of a document may beused to identify other documents of interest to a user. Theidentification of the other documents may be based on an analysis of theentire document or portions of the document rather than merely beingbased on an abstract of the document. In this way, a more accurateidentification of relevant documents is achieved than is achievable byknown search mechanisms.

It should be noted that while the present invention has been describedin terms of both the metadata and the content being compared betweendocuments in order to arrive at a score for the document, the presentinvention is not limited to such embodiments. Rather, the presentinvention may compare only metadata or only content without departingfrom the spirit and scope of the present invention. In such embodiments,for example, the weights associated with metadata or the weightsassociated with content in the equations set forth above may be set tozero in order to eliminate these portions of the equation frominfluencing the resulting score. The result is a set of equations thateither only take into account the comparison of the metadata or thecomparison of the content. Alternatively, the equations themselves maybe modified to eliminate the quantities associated with either metadataor content depending on the embodiment.

It is important to note that while the present invention has beendescribed in the context of a fully functioning data processing system,those of ordinary skill in the art will appreciate that the processes ofthe present invention are capable of being distributed in the form of acomputer readable medium of instructions and a variety of forms and thatthe present invention applies equally regardless of the particular typeof signal bearing media actually used to carry out the distribution.Examples of computer readable media include recordable-type media, suchas a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, andtransmission-type media, such as digital and analog communicationslinks, wired or wireless communications links using transmission forms,such as, for example, radio frequency and light wave transmissions. Thecomputer readable media may take the form of coded formats that aredecoded for actual use in a particular data processing system.

The description of the present invention has been presented for purposesof illustration and description, and is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the art. Theembodiment was chosen and described in order to best explain theprinciples of the invention, the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

1. A method, in a data processing system, of identifying relevantdocuments to a portion of a first document, comprising: receiving anidentification of a portion of a first document from a client device;identifying metadata associated with the portion of the first document;retrieving metadata for a portion of a second document; comparing themetadata associated with the portion of the first document to themetadata for the portion of the second document; comparing content ofthe portion of the first document to content of the portion of thesecond document; and determining a relative relevance of the portion ofthe second document to the portion of the first document based on themetadata and content comparison.
 2. The method of claim 1, whereindetermining a relative relevance of the portion of the second documentto the portion of the first document includes: generating acorrespondence value based on the comparison of the metadata associatedwith the portion of the first document to the metadata associated withthe portion of the second document; and generating a score value for theportion of the second document based on the correspondence value for theportion of the second document.
 3. The method of claim 2, furthercomprising: ranking the second document relative to other documents thathave been compared to the first document based on the generated scorevalue to obtain a ranked list of documents; and providing the rankedlist to the client device.
 4. The method of claim 2, further comprising:generating a content value for the portion of the second document basedon the comparison of the content, wherein the correspondence value andthe content value for the portion of the second document are combined toobtain the generated score value.
 5. The method of claim 1, wherein theportion of the first document is one of the entire first document, achapter of the first document, a paragraph of the first document, asubdivision of the first document, and a search term.
 6. The method ofclaim 1, wherein the portion of the second document is at least one ofan entire portion of the one or more second documents, a chapter of theone or more second documents, a paragraph of the one or more seconddocuments, and a subdivision of the one or more second documents.
 7. Themethod of claim 1, wherein one or more first weights are associated withthe portion of the first document, one or more second weights areassociated with the portion of the second document, and whereingenerating a score value for the portion of the second document based onthe correspondence value for the portion of the second document includesapplying the one or more first weights and the one or more secondweights to the correspondence value.
 8. The method of claim 4, whereinone or more first weights are associated with the portion of the firstdocument, one or more second weights are associated with the portion ofthe second document, and wherein generating a score value for theportion of the second document based on the correspondence value and thecontent value for the portion of the second document includes applyingthe one or more first weights and the one or more second weights to thecorrespondence value and the content value.
 9. (canceled)
 10. The methodof claim 8, further comprising: retrieving one or more weight modifiersfrom a profile; and applying the one or more weight modifiers to one ormore of the first weights and the second weights.
 11. A computer programproduct in a computer readable medium for identifying documents relevantto a portion of a first document, comprising: first instructions forreceiving an identification of a portion of a first document from aclient device; second instructions for identifying metadata associatedwith the portion of the first document; third instructions forretrieving metadata for a portion of a second document; fourthinstructions for comparing the metadata associated with the portion ofthe first document to the metadata for the portion of the seconddocument; fifth instructions for comparing content of the portion of thefirst document to content of the portion of the second document: andsixth instructions for determining a relative relevance of the portionof the second document to the portion of the first document based on themetadata and content comparison.
 12. The computer program product ofclaim 11, wherein the fifth instructions for determining a relativerelevance of the portion of the second document to the portion of thefirst document include: instructions for generating a correspondencevalue based on the comparison of the metadata associated with theportion of the first document to the metadata associated with theportion of the second document; and instructions for generating a scorevalue for the portion of the second document based on the correspondencevalue for the portion of the second document.
 13. The computer programproduct of claim 12, further comprising: instructions for ranking thesecond document relative to other documents that have been compared tothe first document based on the generated score value to obtain a rankedlist of documents; and instructions for providing the ranked list to theclient device.
 14. The computer program product of claim 12, furthercomprising: instructions for generating a content value for the portionof the second document based on the comparison of the content, whereinthe correspondence value and the content value for the portion of thesecond document are combined to obtain the generated score value. 15.The computer program product of claim 11, wherein the portion of thefirst document is one of the entire first document, a chapter of thefirst document, a paragraph of the first document, a subdivision of thefirst document, and a search term.
 16. The computer program product ofclaim 11, wherein the portion of the second document is at least one ofan entire portion of the one or more second documents, a chapter of theone or more second documents, a paragraph of the one or more seconddocuments, and a subdivision of the one or more second documents. 17.The computer program product of claim 11, wherein one or more firstweights are associated with the portion of the first document, one ormore second weights are associated with the portion of the seconddocument, and wherein the instructions for generating a score value forthe portion of the second document based on the correspondence value forthe portion of the second document include instructions for applying theone or more first weights and the one or more second weights to thecorrespondence value.
 18. The computer program product of claim 14,wherein one or more first weights are associated with the portion of thefirst document, one or more second weights are associated with theportion of the second document, and wherein the instructions forgenerating a score value for the portion of the second document based onthe correspondence value and the content value for the portion of thesecond document include instructions for applying the one or more firstweights and the one or more second weights to the correspondence valueand the content value.
 19. The computer program product of claim 18,further comprising: instructions for retrieving one or more weightmodifiers from a profile; and instructions for applying the one or moreweight modifiers to one or more of the first weights and the secondweights.
 20. An apparatus for identifying documents relevant to aportion of a first document, comprising: means for receiving anidentification of a portion of a first document from a client device;means for identifying metadata associated with the portion of the firstdocument; means for retrieving metadata for a portion of a seconddocument; means for comparing the metadata associated with the portionof the first document to the metadata for the portion of the seconddocument; means for comparing content of the portion of the firstdocument to content of the portion of the second document; and means fordetermining a relative relevance of the portion of the second documentto the portion of the first document based on the metadata and contentcomparison.